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Abstract 

With the increasing performance of microprocessors in today’s personal 
computers, the ability to move data quickly and efficiently to the main CPU, as 
well as to a hardware accelerated graphics controller, has become more 
important than ever. This paper exams the performance impact of structuring 
data for efficient processing by the main CPU by examining the performance 
characteristics of several vertex data structure arrangements employed with a 
matrix multiplication algorithm. This paper also exams the performance 
transferring data in a graphics subsystem utilizing the Microsoft DirectDraw API 
function to characterize actual and guaranteed memory throughput between 
main memory, non-local video memory, and local video memory. 


1. Experimental Setup 

All analysis of data structure performance were performed using an Intel 
Pentium® II processor based platforms with the following configuration: 

• 450 MHz processor 

• Intel 440bx chipset 

• 128 MB, 100 MHz, SDRAM 

Data transfer analysis of the hardware accelerated graphics subsystem was 
performed using two Intel Pentium® II processor based platforms with the 

following configuration: 

• Systeml: 400 MHz processor 

. System2: 450 MHz processor 

, Intel 440bx chipset 

. 256 MB, 100 MHz, SDRAM 
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2. Main Memory to the CPU 

The Intel Pentium® II processor incorporates a two level cache system. The 
primary, or LI, cache incorporates separate 16k data and 16k instruction 
memory areas. The secondary, or L2, cache incorporates a unified 512 KB 

memory area. 



440bx 


Below the L2 cache, the Intel 440bx integrated chipset memory controller 
interfaces to a 100 MHz SDRAM main memory subsystem. Memory is typically 
transferred into the cache from main memory, or back to main memory from the 
cache, in multiples of 32 byte cache lines. 

Peak throughput from main memory is ~800 MB/sec, while peak throughput from 

the L2 cache is ~1.3 GB/sec. Not only is the throughput much slower from main 

memory, but the CPU pays a heavy price in clock cycles for accessing main 

memory. A CPU access to the L2 cache takes ~4 CPU cycles, where as a CPU 

access to main memory requires -39 CPU cycles. Since throughput from main 

memory is only about half as much as from the L2 cache, and the CPU must wait 

nearly ten times as long for the operation to complete, the more that can be done 

to keep data in the L2 cache, the less time will be spent by the CPU waiting for 
data. 


Peak throughput from the LI cache is even higher than the L2. However, little 
can be done programmatically to keep code and data in the LI due to a number 


of factors including it’s relatively small 


This paper will then concentrate 


ways to keep data in the L2, and reduce the number of main memory references 
required by the CPU. 


The performance of reading, or writing, to a particular memory address can also 
be affected by the memory type at that address. Five types of memory modes 
are supported by the Pentium® II processor 5 and are configurable via the 
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wr?fo e h SO f S ,. Memory Type Ran 9® Registers (MTRRs). The most common type is 

ack (WB) memory. WB memory reads 32 byte cache lines from main 
memory and uses the cache for data lookup. 

Write-combined (WC, USWC) memory does not perform cache lookups thereby 

a c * ,rec ^y fr° m the memory source, but any writes to a single WC line are 
m ined in an internal CPU write-combining buffer. By combining writes to a 

memory line, multiple writes to the same address space will be lost if the line 
is not evicted between writes to the same memory address. In WC memory, 

reads can also pass writes, which means that reading from WC memory may, or 

\A/r^ n0t ’ return corf ect result. This is called non-coherent memory. The use 
C memory causes a severe read penalty, but significantly increases the write 
performance to a memory region. The WC memory type is used for most 

memory areas where many writes occur, but few reads occur. Non-local video 
memory (also known as AGP memory), as well as the frame buffer of most 

graphics controllers are marked WC. 

The other memory types, uncacheable (UC), write-through (WT) , and write- 
protect (WP), are encountered less frequently. 


3. Structuring Data for Higher Performance 

In order to characterize the impact of data layout on processor performance, a 
test case was created using five layouts of a vertex data structure which was 
used in a standard 4x3 matrix multiplication. 2000 of each data structure were 
allocated, then a loop was utilized to multiply each vertex by a 4x3 matrix, saving 
the result in the same structure, or in another structure. 

Two sets of tests were run for each structure. The first test utilized the Intel 
vTune 3 Pause and Resume API to drive data gathering using the Pentium® II 
CPU performance counters. Three types of data were collected using the CPU 
performance counters: 

• Data memory references (all) 

• L2 cache request misses (highly correlated) 

• L2 cache requests 

This data allows us to determine how well the cache was utilized. 

The second test used the CPU Timestamp Counter (TSC) to determine the 
amount of time each test took to complete. Both tests were executed 10 times 
per structure, and the best times were taken. 

The data structures themselves were chosen based on the types of structures 
that I have implemented directly, or have encountered in other implementations. 
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struct s i 



float x, v. z; 


> 

*■> 


struct s4 

{ 

float *px; 
float *py; 
float *pz; 
float *pvx; 
float *pvy; 
float *pvz; 







*4 

4 




• • 
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struct s3 



float x, y, z, 
float vx, vy> vz; 
float nx, ny, nz 
float tu, tv; 
float r, g, b; 
float sr, sg, sb; 

class cl ^. 

5 >’ 

X 

public: 
cl (void) 



private: 
float x, y, z; 
float vx, vy, vz; 


Each structure was used as part of a standardized loop: 

void processSl(void) 

{ 

si *psrc = new[2000]; 
si *pdest = new[2000]; 

increasePriority(); 

VtResumeSampling(); // or RDTSC 

do 

{ 

// inline matrix multiply 
} while (count < cVertices); 

VtPauseSampling(); //or RDTSC 
resumePriority(); 

} 

In order to reduce the number of system effects, the thread and process priority 
were increased prior to running each test, and reduced after each test. In order 
to reduce the problem of cache effects between tests, new memory was 

allocated at the beginning of each test, and a large memory area was read into 
the cache between each test. 


for (i = 0; i < 10; i++) 
{ 

processSlO; 
invalidateCacheO; 

} 
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4. Performance Results 


demonstrated that the vertex structure containing only data 

structures* 3 *^ mu * t 'P*' cat ' on Performed significantly better than the 




other 
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Even though the total number of data references is nearly the same between the 
best and worst cases, the number of L2 requests twice as much in the worst 
case (s3) as opposed to the best case (si). Also, there are over one hundred 
times as many cache misses in the worst case (s3) as compared to the best case 

(si). 

For some cases, even though the cache performance improved, the actual 
performance time increased (e.g., s4) . This may be due to other factors which 
were not taken into account in these tests. It should be remembered that these 
are best case numbers since the thread and process priority were increased in 
order to obtain more reliable numbers. In actual implementation, the 
performance will probably be less. 


4.1 Counting Down 

Even though there are five test structures, there are six test cases presented. 
Tests si and si a both utilized the same structure, but varied the implementation 

of the looping structure. 
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processSI’s implementation is shown: 


void processSl(void) 

{ 

si *psrc = new[2000]; 
si *pdest = new[2000]; 

DWORD count = 0; 

si *p = psrc; 
si *d = pdest; 

do 

{ 

// inline matrix multiply 

P++; 

d++; 

count++; 

} while (count < cVertices); 

Pointer variables track the source and destination memory addresses, and an 
integer counter tracks the number of vertices that are remaining to transform. 

In processSl a, the integer counter variable is removed, and the source and 
destination pointers are started at the end of their respective memory address 
spaces. 

void processSla(void) 

{ 

si *psrc = new[2000]; 
si *pdest = new[2000]; 

DWORD count = 0; 

si *p = psrc + 200 - 1; 
si *d = pdest + 200 - 1; 

do 

// inline matrix multiply 

p—; 
d—; 

} while (p >= psrc); 

The loop is ended when the source pointer drops below the address of the start 

of the loop. This loop optimization provides another performance gain of -1000 
CPU cycles. 
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4.2 Restructuring Vertex Data 


This data demonstrates that a significant performance improvement can be made 
by restructuring data so that only relevant data is utilized during processing. 
Applying this to the remaining data found in a typical vertex structure, the 
following reorganization of data can take place: 































Struct V eitex 

{ 

float x, y, z; 

}; 

struct TVertex 

{ 

float vx, vy, vz; 

}; 

struct N ormal 

{ 

float nx, ny, nz; 

}; 


struct Color 
{ 



float r, g, b; 

} 

struct Specular 

{ 

float sr, sg, sb; 

}; 

struct TxCoord 

{ 


struct Object 

f 

i 

Vertex *pvertices; 
Tvertex *ptvertices; 
N ormal * pnorm als; 
TxCoord *ptxcoords; 
Color *pcolors; 
Specular *pspeculars 

\ ■ 
s •> 


float tu, tv; 



An individual vertex-based object can now be described using a structure of 
arrays of like data. This places like data in contiguous memory blocks for faster 
and more efficient cache access. 


Only during clipping operations, or reformatting for submission to an API, is all 
the data in a vertex required. During other standard operations, only two or three 
pieces of data from the vertex are required. 


Operation 

Backface cull 
Transform vertices 
Lightin 


Projection 


Clipping 


Data Required 

normal (to face), vertex 
verte x, transformed vertex 

color 


j 


ecular, normal 


Transformed vertex, proje cted vertex 
Everything, if it is clipped 
otherwise, only projected vertex 



























5. Batching Processing of Data 

Many applications process a single piece of data at a time, e.g. transforming a 
single vertex. With like data batched together into contiguous memory, 
processing only a single piece of data at a time wastes the cache improvement 

that’s been made through this data restructuring. 



If only a single vertex were transformed at a time, the performance improvement 
made by restructuring this data would be lost since the improved caching of data 
wouldn’t be taken advantage of. Performing all of one process before moving on 
to the next process can significantly improve throughput to the CPU. However, 
there will be cases, based on the amount of data involved, where piecemeal 

processing will be a performance benefit. 
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6. Memory Throughput in the Graphics Subsystem 

In order to understand the performance impact of the memory types in the 
Qrap ics subsystem, a basic understanding of the memory components and 
ransfer mechanisms in the system are required. 



6.1 Main Memory and PCI 


Peripheral Component Interconnect (PCI) devices can map memory that can 
appear to the host memory controller as any other memory space 6 . This system 
is called Memory Mapped I/O (MMIO), and allows the memory controller and 
CPU to read and write memory on a device as well as configure the memory type 
of MMIO regions using the CPUs MTRRs. 



In addition to the CPU being able to read or write directly to MMIO on a PCI 
device, the device itself can read or write directly to main memory using Direct 
Memory Access (DMA). A PCI device using DMA to access WB main memory 
addresses acts similarly to another CPU, in that the request is snooped, and all 
access is handled coherently. 


The PCI bus runs at 33 MHz standard, and 66 MHz in Fast PCI mode. At 4 
bytes (DWORD) per clock transferred, peak throughput is 132 MB/sec and 264 

MB/sec respectively. 


6.2 AGP 

At their heart, Accelerated Graphics Port (AGP) devices are an extension of PCI 
devices. Fast PCI is known as frame-based AGP, and is the base AGP protocol. 
However, AGP has several extensions beyond standard PCI that allow it to 
handle more data at a time with lower latency sensitivity. 























































































































































































































































































































In AGP 2x mode, data is transferred on the rising and falling edges of the clock, 

thereby doubling the peak data throughput to 528 MB/sec. 





In addition to using DMA to access main memory, specialized non-local video 
memory (NLVM) can be allocated for specific use by AGP. The NLVM is 
memory type WC to the CPU, meaning that it can be written to quickly. This 
NLVM is coordinated on the AGP side by the Graphics Address Relocation Table 
(GART) that acts as a scatter-gather table to map multiple non-contiguous 
segments of main memory, into an apparently contiguous region of NLVM. All of 
main memory except for the last 12 MB can be utilized as NLVM by AGP. 






Since WC memory is a non-coherent memory type, and DMA can only reliably 
access coherent memory, a mechanism is required to access NLVM. Two such 
mechanisms exist in the AGP protocol: pipeline and sidebands. Pipeline mode 
incorporates a set of shared data and address lines to access NLVM. Sidebands 
uses separate data and address lines to access NLVM. The selection of pipeline 
or sidebands is purely based on the AGP device’s implementation. 

The final AGP extension, is the ability for the AGP device to execute data directly 
out of main memory. Unlike DMA which requests a copy of the data, Direct 
Memory Execution (DMX, also known as DIME, and a number of other non¬ 
standard acronyms) allows NLVM to be treated like extended memory by the 
graphics controller. 


6.3 Data Transfer Choices 

With this large number of AGP protocol and transfer choices, the situation 

e^mese™ more C0m pl ex since a single device can use different transfer 
methods depending on the operation the device is performing. 


Transfer Mode 

MMIO 

DMA/Fast Pci/Frame 

Pipelined 


Sideband 


AGP lx 

x 

X 

' X 



AGP 2x 


DMX 


Coherent non-Coherent 

depends on MTRRs 


X 


X 


X 



X 

X 
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A graphics device could use MMIO for commands, DMA to get triangle primitive 

ny y r ° m nr)a ' n memory, and sidebands to access textures from NLVM using 
UMX. This means that different types of data will be bounded by different peak 

data rates. 

7. Memory Bit Performance 

One type of data transfer that is used extensively in graphics is the Bit, or bit 

block transfer. Using the Microsoft DirectDraw API Bit and Bltfast interfaces, 

memory surfaces were created in main memory (MM), NLVM, and local video 

memory (LVM). The performance bitting between the surfaces was then timed in 

order to determine the guaranteed time (the time in which the call returned), and 

the actual time (the time it took to complete copying the data form the source to 

the destination surface). Between these three memory areas, there resulted nine 
different test cases. 


Three recent graphics controllers were chosen at random: 3D Labs Permedia 2, 
Intel i740, and the nVidia Riva TNT. The most recent drivers were obtained from 
their respective web sites (see appendix B). 

7.1 Experimental Setup 

A program, bltalyzer.exe, as created and used to characterize bit performance on 
these three graphics controllers (see collateral, appendix A). The width and 
height of square memory areas was varied, as was the number of bits issued 

(queued) to the controller. 

Timing data was taken using the CPU read timestamp counter (RDTSC) 
instruction at three intervals during the test: before the first bit (start time), after 
the blt(s) were issued (submission or guarantee time), and after the bits were 
finally completed (completed time). The IDirectDrawSurface::Lock function was 
used on the destination surface in order to determine when the pending blt(s) 
actually completed. Each test was repeated ten times, and the best timing was 

used for analysis. 
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for 



(DWORD w = dwwidth, h = dwheight; 

(w < (dwwidth + dwbytes)) && (h < (dwheight + dwbytes)); 

w += 64, h += 64) 

for (DWORD dwqueuedepth = dwqueuemin; 

dwqueuedepth < dwqueuemax; dwqueuedepth++) 

{ f 

for (DWORD iter = 0; iter < dwlterations; iter++) 

{ 

// increase priority 


GetTimeStamp(Sqstart); 

for (DWORD q = 0; q < dwqueuedepth; q++) 

hrblt = pdest->BltFast(0, 0, psrc, &rect, 

DDBLTFAST WAIT); 


GetTimeStamp(Sqguarantee); 

hr = pdest->Lock(0, &ddsd, DDLOCK_SURFACEMEMORYPTR | 

DDLOCK WAIT, 0); 


pdest->Unlock(0); 

GetTimeStamp(&qend); 

// decrease priority 

} 

} 

} 

7.2 Charts Produced 

This setup of guaranteed time and completed time enabled the generation of two 
graphs. The guaranteed time demonstrated how guickly the graphics controller 
could accept bits, and the completed time demonstrated how long it actually too 
the controller to complete processing the data movement operation. 

Typically, the time that could be guaranteed based on submission was 
significantly higher than the actual time required to complete the operation. 
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3D Labs Permedia 2, LVM->LVM, Completed MB/sec 

8. Comparative Main Memory to Main Memory Bit 

• . . | , main memory surface to another is nearlv 

identical between the three controllers. ’ 



V* tr* CN r 4 C4 <N TO CO * 1 IP t& d> 


*Permedia 2 Guaranteed 
'Permedia 2 Completed 
'i74 0 Guara nteeo 
-1740 Completed 
'TNT Guaranteed 
•TNT Completed 















9. Local Video Memory Bit Performance 


The guaranteed performance of the Permedia 2 continues to increase until the 
33 rd bit is queued, then performance drops precipitously. This could be due to a 

number of issues, such as waiting for the bit queue to clear. The completed 
throughput peaks at ~190 MB/sec. 




*1 




•14W.v 


*1 



rV 0.’.v.'O 


'".•U.'Al 


-KK V .X» 





0 0 


. ... _ 0T *- <■* CC 4 *-o 0 

• J *s :*• N ^ t ^ ty rr u: i/: :;*# /: :l- f ^ 

3D Labs Permedia 2, LVM->LVM, MB/sec 



VTfc- 



‘•Y.fr. 





3D Labs Permedia 2, LVM-> LVM, 72x72 pixels, 1 - 64 bits 






The TNT does not show the same type of performance degradation as more bits 
are added. Performance increases until the 8 th bit, and then levels off. The 
completed throughput peaks at -565 MB/sec. 
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The i740, however, operates similarly to the Permedia 2, in that the bit 
performance increases until the 8 th or 16 th bit, then decreases. The completed 
peak throughput is -150 MB/sec. 



10. Main Memory to LVM and NLVM Performance 

The performance of transfers to LVM and NLVM from main memory directly 
impacts application decisions including: How much can I bit each frame? And, 
can I upload new data to the controller during game play? 
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In some cases, the bandwidth available to use for bitting may not be 
during each frame. 




Comparative MM->LVM 


A higher performing alternative might be to bit a surface in NLVM, and let the 
graphics controller pull the data from there. 




' ’ ’ ’ • V 7 ? ? t w ! 

Comparative, MM -> NLVM 


’■TNT guarantee 
*TNT completed 
*P2 guaranteed 
*P2 completed 
■ f i740 guarantetnj 
-<74Q cornpiete-d 


11. Unanswered Questions 


Testing the bit performance between 


subsystem only answers 
be uncovered including 


square memory regions in the graphics 


set of questions. There is stiN much information to 




Performance of bitting non-square areas. 

Performance of texture transfers between 
up sampling, down sampling, etc. 
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Performance of primitive data transfers to the graphics controller testing: 
type, number, format, etc. 


Summary 


It is important to analyze the data usage, data layout, and data flow in time 
sensitive algorithms. Placing relevant data together can certainly complicate 
data structure layout, but it can also significantly improve performance by 
improving cache usage and reducing cache misses. 


With relevant data grouped together, batching processing will lead to higher 
performance gains instead of piecemeal processing in most cases. 

In order to determine whether an algorithm or data structure is truly better, 
profilers that utilize the CPU performance counters, as well as accurate timing, 
should be used to collect objective data. 

Collateral 

An electronic version of this paper, presented foils, all data gathered and 
analyzed, as well as the source code of both analysis programs, are available at 
both the Game Developer Conference (GDC) web site at http://www.gdconf com ; 

as well as at the Ensemble Studios web site http://www.ensemblestudios.com , 

under “Developer News.” 

The data files for data structure analysis are relatively small, however the data 
files for the graphics subsystem data transfer analysis comprise more than 30 
MB of spreadsheet data. 

All data is stored in Microsoft PowerPoint 97, Microsoft Excel 97, and Microsoft 
Word 97 formats. 
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Appendix A 

The bltalyzer.exe program for analyzing performance of the graphics subsystem 
supports the following command-line parameters. Examples of the usage of 
these parameters can be found in the supplied sample source code. 

output=filename 
specify output filename 

xres=x 

width of surfaces to allocate 
yres=y 

height of surfaces to allocate 

src=[mm|nlvm|lvm] 

type of memory to bit from 

dest=[mm|nlvm|lvm 

type of memory to bit to 

op=[blt|bltfast] 

type of bit to perform 
iter=n 

number of iterations to perform 
bytes=n 

maximum number of bytes in width or height to bit 
width=n 

width of starting bit 

height=n 

height of starting bit 


qmin=n 

minimum number of bits to queue 
qmax=n 

maximum number of bits to queue 
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Appendix B 


The following data was gathered from the respective graphics controllers using 

e IDirectDraw::GetDeviceldentifier function call. 


3D Labs Permedia 2 

driver description 

gldd32.dll Diamond FIRE GL 1000 PRO 

product ver subver bid vendor device subsys rev 

4 10 1 2359 4172 15623 22286482 1 

Intel i740 

driver description 

R3Dd32M.dll Starfighter AGP Release 0322, Standard 


product 

ver subver 

bid 

vendor 

-r 

device 

subsys 

rev 

4 

10 1 

322 

32902 

30720 

524349 

33 

nVidia RIVA 128 






driver 

description 






NV4DD32. 

DLL NVidia RIVA 

TNT 





product 

ver subver 

bid 

vendor 

device 

subsys 

rev 

4 

10 1 

48 

4318 

32 

659099828 

4 
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LI Cache 


1 2 Cache 









16k/16k 


(512 KB) 


440bx 




























Experimental platforms 




















Main 




















Intel Pentium 


i * 

L A 




4(K) Mhz and 450 Mhz 


















Memory 

























Intel 440bx chipset 






















256 MB 





1 
















































































































I Cache 

16k/16k 


L2 Cache 

(512 KB) 






440bx 








32 bytes / cache 
CPU can access MM @ 800 MB/sec 

CPU can access L2 cache 

~4 CPU clocks to access L2 

39 CPU clocks to access MM 












1.3 GB/sec 


Main 

Memory 


Original Celerons don’t have L2 cache 

»i upercomputer* don’t have cache memory, either 

Wony about making your L2 access better 
Don’t worry about the LI 

there’* really nothing you can do there 












Memoiy Type Range Registers (MTRRs) 


Conn ols types of all memory spaces 




Wrile-Back (W] 




Used by most of MM 


Cache lookups. Line reads 








M !^SI line transitions 


Modified. Exclusive, Shared, Invalid 


Uncacheable (IJC) 


No cache lookup 


Reads not turned into line reads 


Posted writes. No speculative reads 


































































































































































Getting data from MM is slow 

Use data from cache for best performance 

Experimental data 

- 4x3 matrix transform of (n = 2000) vertices 

- 5 different data structures 

- vTune 3.x pause/resume API 

• Data Memory References (all) 

• L2 Cache Request Misses (highly correlated) 

• L2 Cache Requests 

- RDTSC (read timestamp counter) 







Structures 


struct s3 

{ 

float x, y, z, 
float vx, vy, vz; 
float nx, ny, nz 
float tu, tv, 
float r, g, b; 
float sr, sg, 

}; 


struct s4 
{ 

float *px; 
float *py; 
float *pz; 
float *pvx; 

float *pvy; 
float *pvz; 


10 


Experimental Setup (pseudocode) 

void processSl(void) 

{ 

si *psrc = new[2000J; 
si *pdest = new[2000); 

increasePriorityO; 

VtResumeSamplingO; // or RDTSC 

do 

{ 

// inline matrix multiply 
} while (count < cVertices); 

VtPauseSamplingO; // or RDTSC 

resumePriorityO; 


for (i = 0; i < 10; i++) 

processSlO; 

invalidateCacheO; 






Tringl 

19456 


TringO CPU 



1,802,704 


7501 

19724 


453 


1,803,839 


6743 



19150 


2,297,085 


7530 

20027 


371 

284 


1,987,027 


12574 

20042 


347 


2,256,448 


7518 

20034 

452 

15014 


446 

162 


675 


1,992,929 








• M m - m * a _ _ k A ar> ^ 






















voiid processSl (void) 



void processSl a(void) 





si 

si 



* 


pirc 1 
pdest 




ew|2000]; 

| 2000 ); 









DWORD count 




si 

si 








psrc; 

pdest; 

















do 






inline matrix multiply 













P++; 









more variable 


si 




sl 


A 


psrc • 
pdest 


new (2 


inn 





( 2000 ); 


DWORD count 






sl 

sl 


* 







psrc + 200 - 
pdest + 200 



» 









// inline matrix multiply 



count++: 



while (count 


p—; 
d-: 

















while (p 







psrc); 
























struct s3 


float x, y, z, 
float vx, vy, vz; 
float nx, ny, nz 
float tu, tv, 
float r, g, b; 
float sr, sg, sb; 


struct Vertex 


float x, y, z; 


struct TVertex 


float vx, vy, vz; 


struct Normal 


float nx, ny, nz; 


struct Color 


float r, g, b; 


struct Specular 


float sr, sg, sb; 


struct Tx Coord 


float tu, tv. 


struct Object 


Vertex 


pverticcs; 


Tvertex *ptvcrtices; 
Normal *pnormals; 
TxCoord *ptxcoords; 

Color *pcolors; 
Specular *pspcculars; 


You don’t need all the data in a vertex all the time! 







Backface cull 

- norm al (to face) 

Transform vertices 

- vertex, transformed vertex 

Lighting 

- color, specular, normal 

Projection 

- transformed vertex, projected vertex 






















4 byt— 

•4 imw 

• 

liutlss 

:iluss 

• 

:4bn» 

Hate* 

♦ 

■* 

** 
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• 

* f. 

> 

• 9 

• * 

V 














Cache Miss 



Cache Miss 




Cache Miss 



Cache Miss 



Cache Miss 




Cache Miss 


Vertex 


cl; me 

fiy !ni 



d'r 

s ; b 


4 1# hyt— 

cl 

. py .:**._ 

• 1 bytH 

4 byt— ;UaSM 

d;r 

1 * 
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What 


Main Memory 


CPU 


Chipset/ 

Memory 

Controller 


Memory Mapped 10 (MMIO) 


MMIO 


DMA 


Graphics Controller 


NIC 


CPU can rcad/write mapped memory on device 
CPU pushes/pulls data to/from device 

Direct Memory Access (DMA) 

PCI device can read/write main memory 
CPU doesn’t need to be involved 


Coherent - MC ensure relevant CPU caches written before device reads 
33 Mhz for consumer, 66 Mhz (aka “Fast PCI 7 *) for server / workstations 

Bandwidth @ 33 Mhz 

1 DWORD (4 bytes) per clock, 132 MB/sec 














































I Chipset/ 1 ■ 

DMA 

J Memory 

AGP 

[Controller | 













































nest data 


Chipset/ 

MC 



Graphics 

Controller 


• Pipeline Mode (PIPE) 

- Shared address and data line 

- GC can either request data or receive data 

• think of it like half-duplex 

- GC can have multiple outstanding requests 
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Chipset/ 


A44rmn Mm 


to Rtflivt Data 


MC 


Graphics 










Data 


to Racalva Data 


Controller 




























Side-bands (SB) 




























Separate address 






data lines 






































can simultaneously request and receive 






























think of it like full-duplex 


























































can have multiple outstanding requests 


















Maximizing AGP Performance 




























http:// www.ag pforum. org 















































































































Even with PIPE or SB, still need to copy 


data to use (a la DMA) 



Direct Memory Execute 

■Ability of GC to treat MM as extended memory 

think of it like XMS 

MM memory space is treated as extension of 
local video memory (LVM) 

GC can see LVM and NLVM as single 

contiguous memory area 









-, Choices 

good AGP platform has: 

High performing chipset/MC 
High performing GC 

Performance will be bounded by lower 
performing component 




Transfer Mode 


AGP lx 


AGP 2x 






DMX 




LiiVY 


•"'Coherent ' non-Coherent 








% X 


9* IT* * 

* * 










r 97 . . 

f ~ " l - . t > - 4 


. 






♦ 1 

■* # 















* 














* • ** 'V--V 













: . - 










* j* 






can mix-and-match transfer methods 

MMIO for commands 
DMA triangles from MM 
Sideband textures from MM using DMX 

Note: This means that different data will be 

bounded by different bandwidth rates 


eg 
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Main Memory to the CPU 




























Optimizing Data Structures 






























Batching Processing 




















































































































Unanswered Questions 
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FromXTo 

MM 

NLVM 
LVM 


MM 


NLVM 


LVM 
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3D Labs Perm 



dia 2 














driver 

gldd32.dll 

prodit 


description 
Diamond FIRE GL 1 




PRO 



ver 

10 


subva 


bid 


vendor device subsys 


rev 


2359 4172 


15623 22286482 


Intel i7 40 


driver 

R3Dd32M.dll 


description 

Surfigjaer AGP Release 0322. Standard 






product 




ver 

10 


subva 



bid 

322 


vendor device subsys 
32902 30720 524349 


rev 

33 



nVidia Riva TNT 


driver 


description 


NV4DD32DLL NVidia RIVA TNT 


product 



ver 

10 


subva 



bid 

48 


vendor device subsys 


rev 


4318 


32 


659099828 4 

















Test Loop 


for (DWORD w= dwwidth, h = dwheight; 


(w < (dwwidth + dwhytes)) && (h < (dwheigit + dwbytes)); 


w+= 64, h 



64) 




for (DWORD dwqueuedepth = dwqueuemki; dwqueuedepth < dwqueuemax; dwqueuedepth++) 


{ 


for (DW ORD iter = 0; iter < dwlterations; iter-H-) 




// increase priority 


Sta r t Bits 


GetT imeStamp (&q start); 


for (DWORD q = 0; q < dwqueuedepth; q++) 

hrblt = pdest->BltFast(0,0, psrc, direct, DDBLTFAST_WAIT); 


GetT imeStamp (^guarantee); 



Submission Completed 


hr = pdest->Lock(0, &ddsd, DDLOCK_SXJRFACEMEMORYPTR | DDLOCK_WArT, 0), 


pdest->Unlock(P); 


GetT imeStamp (&q end); 



Bits Actually Complete 


// decrease priority 









3/mc 




















































LVM2LVM 



MB/Sec 
















-©- -guKtntetd 

-o- -compwt «d U Bfttc 









































































































































3/MC 











































9/NC 


% 











a/Mt 





































V—c 
































































































































BLT of non-uniform areas 

thin horizontal or vertical areas 

Performance of texture transfers from NL VM to 


LVM 


even LVM to LVM) 


BLT may approximate (or even equal) performance 
for GC that don’t re-arrange texture contents 

higher locality 

Specific test required 


test GC texture throughput 


Size, direct sampling, up sampling, down sampling 

MIPmaps 

Performance of primitive data transfers 

type, number, format 
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This presentation ... 

paper, foils, 2 demo programs, 40+ MB of spreadsheets 
http ://www.gdconf. com/... 

http://www.ensemblestudios.com/... , “Developer News 












AGP 


AGP Implementors Forum (http://www.agpforum.com) 

CPU & AGP Information 


http .//developer, intel. com 


vTune 


http ://developer. intel. com/vtune 
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